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Abstract 

In this paper, we propose a novel policy iteration method, called dynamic policy pro- 
gramming (DPP), to estimate the optimal policy in the infinite- horizon Markov decision 
processes. We prove the finite-iteration and asymptotic £oo-norm performance-loss bounds 
for DPP in the presence of approximation/estimation error. The bounds are expressed in 
terms of the ^oo-norm of the average accumulated error as opposed to the ^oo-norm of the 
error in the case of the standard approximate value iteration (AVI) and the approximate 
policy iteration (API). This suggests that DPP can achieve a better performance than 
AVI and API since it averages out the simulation noise caused by Monte-Carlo sampling 
throughout the learning process. We examine this theoretical results numerically by com- 
paring the performance of the approximate variants of DPP with existing reinforcement 
learning (RL) methods on difi'erent problem domains. Our results show that, in all cases, 
DPP-based algorithms outperform other RL methods by a wide margin. 
Keywords: Approximate dynamic programming, reinforcement learning, Markov deci- 
sion processes, Monte-Carlo methods, function approximation. 



1. Introduction 



Many problems in robotics, operations research and process control can be represented as a 
control problem that can be solved by finding the optimal policy using dynamic programming 
(DP). DP is based on the estimating some measures of the value of state-action Q*[x^a) 
through the Bellman equation. For high-dimensional discrete systems or for continuous sys- 
tems, computing the value function by DP is intractable. The common approach to make the 
computation tractable is to approxima t e the value function using function- approximation 
and Monte- Carlo sampling ( Szepesvari . 20ld : Bertsekas and Tsitsiklisl . fl99i ). Examples of 
such approximate dynamic programming (AD P) methods are approximate policy iteratioii 
(API) and approxirnate value iteration (AVI) ( Bertsekai . 2007 ; Lagoudakis and Parr , 20031 : 
Perkins and Precup . 20021 : de Farias and Royl . 200d ). 

ADP methods have been successfully applied to many real world problems, and theo- 
retical results have been derive d in the form of fiiiite iteration and asymptotic performance 
guarantee of the induced policy ( Farahmand et al. . 2010 : Thiery and Scherrer . 20IC : Munosl . 
20051 : iBertsekas and Tsitsiklisl . ll996l ). The asymptotic £oo-norm performance-loss bounds of 
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API and AVI are expressed in terms of the supremum, with respect to (w.r.t.) the number 
of iterations, of the approximation errors: 

27 

hmsupllQ* - Q'^'^W < — — — ^Hmsup ||efc|| , 

where 7 denotes the discount factor, || • || is the ^oo-norm w.r.t. the state-action pair {x,a). 
Also, vTfc and are the control policy and the approximation error at round k of the ADP 
algorithms, respectively. In many problems of interest, however, the supremum over the 
normed-error ||efc|| can be large and hard to control due to the large variance of estimation 
caused by Monte-Carlo sampling. In those bound which instead depends on the 

average accumulated error ej^ = + 1) j=o^j preferable. This is due to the fact that 
the errors associated with the variance of estimation can be considered as the instances of 
some zero-mean random variables. Therefore, one can show, by making use of a law of large 
numbers argument, that those errors are asymptotically averaged out by accumulating the 
approximation errors of all iterations 

In this paper, we propose a new mathematically-justified approach to estimate the op- 
timal policy, called dynamic policy programming (DPP). We prove finite-iteration and 
asymptotic performance loss bounds for the policy induced by DPP in the presence of 
approximation. The asymptotic bound of approximate DPP is expressed in terms of the 
average accumulated error \\ek\\ as opposed to \\ek\\ in the case of AVI and API. This result 
suggests that DPP may perform better than AVI and API in the presence of large vari- 
ance of estimation since it can average out the estimation errors throughout the learning 
process. The dependency on the average error \\ek\\ follows naturally from the incremental 
policy update of DPP which at each round of policy update, unlike AVI and API, accumu- 
lates the approximation errors of the previous iterations, rather than just minimizing the 
approximation error of the current iteration. 

This article is organized as follows. In Section [21 we present the notations which are 
used in this paper. We introduce DPP and we investigate its convergence properties in 
Section [3l In Section [U we demonstrate the compatibility of our method with the approxi- 
mation techniques. We generalize DPP bounds to the case of function approximation and 
Monte-Carlo simulation. We also introduce a new convergent RL algorithm, called DPP- 
RL, which relies on an approximate sample-based variant of DPP to estimate the optimal 
policy. Section [5l presents nun ierical experiments on several problem domains including the 
optimal replacement problem ( Munos and Szepesvari . 20081 ) and a stochastic grid world. In 



Section [6] we briefly review some related work. Finally, we discuss some of the implications 
of our work in Section [71 



2. Preliminaries 

In this section, we introduce some concepts and definitions from the theory of Markov 
decision processes (MDPs) and reinforcement learning (RL) as well as some standard no- 
tations H We begin by the definition of the ^2-iiorm (Euclidean norm) and the i^-novm 

1. The law of large numbers requires the errors to satisfy some stochastic assumptions, e.g., they need to 
be identically and inde pendently d i stribu ted (i.i.d.) samples or martingale differences. 

2. For further reading see ISzepesvaril (|20f d ). 
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(supremum norm). Assume that y is a finite set. Given the probabihty measure /i over 
y, for a real- valued function g : y — )■ R, we shall denote the ^2-iiorm and the weighted 
£2,/.-norm of g by ||5r||2 = Eyey divf and = Ylya Ky)9{yf, respectively. Also, the 

^oo-norm of g is defined by \\g\\ = max^gy \g{y)\- 

2.1 Markov Decision Processes 

A discounted MDP is a quintuple {X,A,P,Jl,j), where X and A are, respectively, the 
state space and the action space. P shall denote the state transition distribution and 31 
denotes the reward kernel. 7 G [0, 1) denotes the discount factor. The transition P is a 
probability kernel over the next state upon taking action a from state x, which we shall 
denote by P{-\x,a). is a set of real-valued numbers. A reward r{x,a) € 3? is associated 
with each state x and action a. To keep the representation succinct, we shall denote the 
joint state-action space X x yi by Z,. 

Assumption 1 (MDP Regularity) We assume X and A = {oi, 02, . . . , ol} are finite 
sets. Also, the absolute value of the immediate reward r{x,a) is bounded from above by 
Rmax > for all {x, a) G Z. We also define Fmax = -Rmax/(1 - t)- 

A policy kernel '7r(-|-) determines the distribution of the control action given the past ob- 
servations. The policy is called stationary and Markovian if the distribution of the control 
action is independent of time and only depends on the last state x. Given the last state x, 
we shall denote the stationary policy by 7r(-|x). A stationary policy is called deterministic 
if for any state x there exists some action a such that vr(-|x) concentrates on this action. 
Given the policy vr its corresponding value function : X — )■ M denotes the expected value 
of the long-term discounted sum of rewards in each state x, when the action is chosen by 
policy TT which we denote by V'^{x). Often it is convenient to associate value functions 
not with states but with state-action pairs. Therefore, we introduce Q'^ : 2, — )■ R as the 
expected total discounted reward upon choosing action a from state x and then following 
policy vr, which we shall denote by Q^{x,a). We define the Bellman operator 7'^ on the 
action-value functions by: 



We also notice that Q'^ is the fixed point of T'^. 

The goal is to find a policy vr* that attains the optimal value function, V*{x) = 
supj^y^(x), at all states 2; G X. The optimal value function satisfies the Bellman equa- 
tion: 




V(x, a) G 2,. 



V*{x) = sup ^ 7r(a|x) [r(x, a) + P{x'\x, a)V*{x')\ 



Vx G X. 



(1) 
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Likewise, the optimal action-value function Q* is defined by Q*{x,a) = sup^Q^{x,a) 
for all (x, a) € 2,. We shall define the Bellman optimality operator T on the action-value 
functions as: 

a) = r(x, a) + 7 > P{x'\x,a)uia.-KQ{x\a'), V(x,a)GZ. 
Likewise, Q* is the fixed point of T. 

Both T and are contraction mappings, w.r.t. the supremum norm, with the factor 
7 ( Bertsekai . 2007 . chap. 1). In other words, for any two action- value functions Q and Q', 
we have: 

\\7Q-7Q'\\ <^\\Q-Q'\\, \\7^Q-7"Q'\\ <7\\Q-Q'\\- (2) 

The policy distribution vr defines the state-action transition kernel P"^ : M{Z) M{Z), 
where M is the space of all probability measures defined on 2., as: 

P^(x',a'|x,a) = TT{a'\x')P{x'\x,a). 

From this kernel a right-linear operator P"^- is defined by: 

(P^Q)(x,a)= P^{x',a'\x,a)Q{x',a'), V(x,a)eZ. 

(x',a')GZ 

Further, we define two other right-linear operators vr- and P- by: 
{■kQ){x) = 'Y^iT{a\x)Q{x,a), Vx G X, 

aeA 

iPV){x,a) = ^P(x'|x,a)y(x'), V(x,a) G Z. 
x'ex 

We define the max operator M on the action value functions by (M.Q){x) = max^gyi Q{x, a) 
for all X G X. Based on the new definitions one can rephrase the Bellman operator and the 
Bellman optimality operator as: 

r'Q{x,a)=rix,a)+-y{P''Q){x,a), TQ(x, a) = r(x, a) + 7(PMQ)(x, a). (3) 



3. Dynamic Policy Programming 

In this section, we derive the DPP algorithm starting from the Bellman equation. We first 
show that by adding a relative entropy term to the reward we can control the deviations 
of the induced policy from a baseline policy. We then derive an iterative double-loop ap- 
proach which combines value and policy updates. We reduce this double-loop iteration to 
just a single iteration by introducing DPP algorithm. We emphasize that the purpose of 
the following derivations is to motivate DPP, rather than to provide a formal character- 
ization. Subsequently, in Subsection 13.21 and Section H] , we theoretically investigate the 
finite-iteration and the asymptotic behavior of DPP and prove its convergence. 
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3.1 From Bellman Equation to DPP Recursion 

Consider the relative entropy between the policy vr and some baseline policy vf: 

7r(a|x) 

i{a\x) log 



gl{x) = KL(7r(-|2;)||7r(-|x)) = ^7r(a|x) log 



7r(a|x) 



We define a new value function ^ for all x € X, which incorporates 5 as a penalty term 
for deviating from the base policy vf and the reward under the policy vr: 



lim E 



k=0 ^ ' ^ 



Xt=X 



where r/ is a positive constant and r^+fc is the reward at time t + k. Also, the expected 
value is taken w.r.t. the state transition probability distribution P and the policy vr. The 
optimal value function V^{x) = sup^V^{x) then satisfies the following Bellman equation 
for all x £ X: 



v:(x) 



sup X^ vr(a|x) 



1 vr(a|a;) 

r{x, a) log _, I , + j{PV^)[x, a) 

rj vr(a|xj 



(4) 



Equation (j4]) is a modified version of ([T]) where, in addition to maximizing the expected 
reward, the optimal policy vf* also minimizes the distance with t he baseline poli cy vf . The 
maximization in ([4]) can be performed in closed form. Following Todorov ( 20061 ). we state 
Proposition [1] 

Proposition 1 Let i] be a positive constant, then for all x the optimal value function 
V^{x) and for all {x,a) E Z the optimal policy vf*(a|x), respectively, satisfy: 



= - ^og'^7r{a\x)exp[ri{r{x,a) + -f{PV^){x,a))]. 



vr*(a|j;) = 
Proof See Appendix Rl 



vr(a|x) ex.p[ri{r{x, a) + j{PV;^){x, a))] 
exp (TjViix)) 



(5) 
(6) 



The optimal policy vf* is a function of the base policy, the optimal value function 
and the state transition probability P. One can first obtain the optimal value function 
through the following fixed-point iteration: 

Vi+\x) = ilog j;vf(a|a;)exp[r7(r(x,a) +7(P^^')(x,a))], (7) 

and then compute vf* using vf* maximizes the value function However, we are 
not, in principle, interested in quantifying vf*, but in solving the original MDP problem and 
computing vr*. The idea to further improve the policy towards vr* is to replace the base- 
line policy with the just newly computed policy of ([6]). The new policy can be regarded 
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as a new base-line policy, and the process can be repeated again. This leads to a double- 
loop algorithm to find the optimal policy vr*, where the outer-loop and the inner- loop 
would consist of a policy update, Equation ([6]), and a value function update, Equation ([7]), 
respectively. 

We then follow the following steps to derive the final DPP algorithm: (i) We introduce 
some extra smoothness to the policy update rule by replacing the double-loop algorithm 
by direct optimization of both value function and policy simultaneously using the following 
fixed point iterations: 

VtH^) = - log Y.^{a\x) exp [r/(r(x, a) + 7(^^*')(^, a))] , (8) 
- Ttk{a\x)exp[r]{r{x,a) +j{PV^)ix,a))] 

TTk+i{a\x) = — ^ . (9) 

expLv^+Hx) 



Further, (ii) we define the action preference function ^'j. ( Sutton and Barto . 19981 ). for 
all (x,a) € 2, and /c > 0, as follows: 

^k+i{3o,a) ^ - log7rk{a\x) + r{x,a) + j{PV^){x,a). (10) 
V 

By comparing pO|) with Q and ([8]), we deduce: 

-/IN exp(7?^fc(x,a)) 

= Eexp(,^.(x,a'))' ^''^ 

a'eA 

V^^ix) = - log V exp(?7M/fc(x, a))). (12) 
■n ^-^ 

Finally, (ill) by plugging (jlip and (|12p into (jlOp we derive: 

■^k+i{x, a) = -^kix, a) - Lr^^kix) + r(x, a) + -f{PLr,'i>k){x, a), (13) 

with operator being defined by £^^'(x) = l/ry log ^^^^^ exp(?7^'(x, a)). is one 

form of the DPP equations. There is a more efficient and analytically more tractable 
version of the DPP equation, where we replace by the Boltzmann soft-max defined 

by Mr,^'(x) = YlaeA [^^'Pi'n^{x,ci))'^{x,a)/ J2a'GA^^P(''l^(^^^'))]^ In principle, we can 
provide formal analysis for both versions. However, the proof is somewhat simpler for the 



3. Replacing with M,, is motivated by the following relation between these two operators: 

- M„*(a:)| = l/r]H^{x) <-, Vs £ X, (14) 

V 

with HTt(x) is the entropy of the policy distribution tt obtained by plugging '5 into p6|) . In words, 
'M.rj'^ix) is close to L^^'^ix) up to the constant I//77. Also, both £„^ (x) and JA-n'^jx ) converge to 'M.'^ix) 
when 77 goes to +00. For the proof of (|14l) and further readings see iMacKavl (|2003l . chap. 31). 
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JArj case, which we will consider in the remainder of this paper. By replacing £^ with 
we deduce the DPP recursion: 



^k+i{x, a) = 0^k{x, a) = ^k{x, a) + r{x, a) + -fPMn^kix, a) - Mrj^k{x] 

= a) + T'>''^kix, a) - TTk^^kix) 



, V(x, a) G Z, 
(15) 



where is an operator defined on the action preferences and vr^ is the soft-max policy 
associated with ^pk■ 



T^kia\x) 



exp(7?^fc(x,a)) 
J2 exp(7/^'fc(x,a')) ' 

a'&A 



(16) 



In Subsection 13.21 we show that this iteration gradually moves the policy towards the 
greedy optimal policy. Algorithm [1] shows the procedure. 

Finally, we would like to emphasize on an important difference between DPP and the 
double- loop algorithm resulted by solving (j4]). One may notice that DPP algorithm, regard- 
less of the choice of r], is always incremental in ip and vr even when r/ goes to -|-oo, whereas, 
in the case of double-loop update of the policy and the value function, the algorithm is 
reduced to standard value iteration for rj = +oo which is apparently not incremental in the 
policy vr. The reason for this difference is due to the extra smoothness introduced to DPP 
update rule by replacing the double- loop update with a single loop in ([SD and ©. 



Algorithm 1: (DPP) Dynamic Policy Programming 



Input: Randomized action preferences ^'o(')) 7 a-'^d rj 
1 for /fc = 0,1,2, ... 1 do 



for each {x, a) € Z, do 
for each a' G A do 



2^ exp(r?^fc(x,a )) 



a"GA 



end 
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-^k+iix, a) := ^k{x, a) + J^^^kix, a) - 7rfc^'fc(x); 
end 

8 end 

9 for each (x, a) G Z, do 
exp(?7^j^(x,a)) _ 



T^K{a\x) 



J2 exp(?7^'A'(x,a')) 

a'eA 



11 end 

12 return vr^; 



3.2 Performance Guarantee 

We begin by proving a finite iteration performance guarantee for DPP: 
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Theorem 1 ( The I oQ-norm performance loss bound of DPP) Let Assumption[l\hold. 
Also, assume that is uniformly bounded from above by Vma.x for all {x,a) G Z, then the 
following inequality holds for the policy induced by DPP at round k > 0: 



\\Q*-Q^'\\ < 



27 ( 4Fmax + ^ 



(l-7)2(fc + l) • 

Proof See Appendix [Bl ■ 

We can optimize this bound by the choice of 77 = 00, for which the soft-max policy and the 
soft-max operator Mr; are replaced with the greedy policy and the max-operator M. As an 
immediate consequence of Theorem [H we obtain the following result: 

Corollary 2 The following relation holds in limit: 

lim Q'^''{x,a) = Q*{x,a), V(x,a)s2.. 

In words, the policy induced by DPP asymptotically converges to the optimal policy vr*. 
One can also show that, under some mild assumption, there exists a unique limit for the 
action preferences in infinity. 

Assumption 2 We assume that MDP has a unique deterministic optimal policy vr* given 
by: 

*/ I \ f 1 a = a*{x) ^ ^ 

^ ("1^) = 1 otherwise ' ^ ^' 

where a*{x) = argmax^gy^ a). 

Theorem 3 Let Assumption{l\ andl^hold and k be a non-negative integer and let ^kix,a), 
for all (x, a) G Z, be the action preference after k iteration of DPP. Then, we have: 

1- T / \ f V*(x) a = a*(x) ^ ^ 
[ — 00 otherwise 

Proof See Appendix [Cl ■ 



4. Dynamic Policy Programming with Approximation 

Algorithm [1] (DPP) only applies to small problems with a few states and actions. One can 
generalize the DPP algorithm for the problems of practical scale by using function approx- 
imation techniques. Also, to compute the optimal policy by DPP an explicit knowledge of 
model is required. In many real world problems, this information is not available instead 
it may be possible to simulate the state transition by Monte-Carlo sampling and then ap- 
proximately estimate the optimal policy using these samples. In this section, we provide 
results on the performance-loss of DPP in the presence of approximation/estimation error. 
We then compare ^Q^-norm performance-loss bounds of DPP with the standard results of 
AVI and API. Finally, We introduce new approximate algorithms for implementing DPP 
with Monte-Carlo sampling (DPP-RL) and linear function approximation (SADPP). 



8 



Dynamic Policy Programming 



4.1 The £oo-norm performance-loss bounds for approximate DPP 

Let us consider a sequence of action preferences {^'o, ^i, ^2, • • • } such that, at round k, 
the action preferences ^k+i is the result of approximately applying the DPP operator by 
the means of function approximation or Monte-Carlo simulation, i.e., for all {x,a) S 2,: 
^k+iix,a) 0^'/fc(x,a). The error term is defined as the difference of and its 
approximation : 



€k{x, a) = '^k+i{x, a) - 0^k{x, a), V(x, a) G Z. 

The approximate DPP update rule is then given by : 

^k+i{x, a) = ^k{x, a) + r{x, a) + 7PM^^'fc(x, a) - Mr,^k{x, a) + Ckix, a) 
= ^'fc(x, a) + T''"ifk{x, a) - iTk'ifkix, a) + ek{x, a), 



(17) 



(18) 



where vTfc is given by (jlGh . 

We begin by finite iteration analysis of the approximate DPP. The following theorem 
establishes an upper-bound on the performance loss of DPP in the presence of approximation 
error. The proof is based on generalization of the bound that we established for DPP by 
taking into account the error e^: 

Theorem 4 (^oo-norm performance loss bound of approximate DPP) Let Assump- 
tionUlhold. Assume that k is a non-negative integer and ^'o is hounded by V^ax- Further, 
define for all k by ()17p and the accumulated error Ek as: 



Ek{x,a) = ^ej{x,a), 

3=0 



k: 



y{x,a)€Z. (19) 
Then the following inequality holds for the policy induced by approximate DPP at round 

HQ* -Q'^'ll < 



(1-7)(A: + 1) 



27 4ymax + 



i°g(^) 



(1-7) 



j=0 



Proof See Appendix [Dl 



Taking the upper-limit yields in the following corollary of Theorem [H 

Corollary 5 (Asymptotic £oo-norm performance-loss bound of approximate DPP) 

Define e = lim supyt_s>oo ll-^fcll /(^ + 1)- Then, the following inequality holds: 



limsup||Q*-Q^'=|| <7-^^^ 

k^oo (1 - 7) 



(20) 



T he asymptotic bound is similar to the existing results of AVI and API (jThierv and Scherrer 

2010 ; iBertsekas and Tsitsiklisl . Il996l . chap. 6): 



limsupllQ* -Q'^'^ll < 



fc— >oo 



27 
(1-7)^ 
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where ^max = 1™ sup^.^oQ Ikfcll- The difference is that in ()20p the supremum norm of error 
£max is replaced by the supremum norm of the average error e. In other words, unhke AVI 
and API, the size of error at each iteration is not a critical factor for the performance of DPP 
and as long as the size of average error remains close to 0, DPP can achieve a near-optimal 
performance even when the error itself is arbitrary large. To gain a better understanding of 
this result consider a case in which, for any algorithm, the sequence of errors {eo, €1,62, ■■ ■} 
are some i.i.d. zero-mean random variables bounded by < C/ < 00. We then obtain the 
following asymptotic bound for the approximate DPP by applying the law of large numbers 
to Corollary m 



limsup HQ* - Q''*^!! < - — e = 0, w.p. (with probability) 1, (21) 

fc^oo (1 ~ 1) 

whilst for API and AVI we have: 

limsupllQ* -Q'^ll <—^^U. 

k-too (1 - 7) 

In words, approximate DPP manages to cancel the i.i.d. noise and asymptotically converges 
to the optimal policy whereas there is no guarantee, in this case, for the convergence of 
API and AVI to the optimal solution. This result suggests that DPP can average out the 
simulation noise caused by Monte-Carlo sampling and eventually achieve a significantly 
better performance than AVI and API in the presence of large variance of estimation. We 
will show, in the the next subsection that a sampling-based variant of DPP (DPP-RL) 
manages to cancel the simulation noise and asymptotically converges, almost surely, to the 
optimal policy (see Theorem [6]). 

4.2 Reinforcement Learning with Dynamic Policy Programming 

To compute the optimal policy by DPP one needs an explicit knowledge of model. In many 
problems we do not have access to this information but instead we can generate samples by 
simulating the model. The optimal policy can then be learned using these samples. In this 
section, we introduce a new RL algorithm, called DPP-RL, which relies on a sampling-based 
variant of DPP to update the policy. The update rule of DPP-RL is very similar to (|15p . 
The only difference is that, in DPP-RL, we replace the Bellman operator 7'^'^{x,a) with 
its sample estimate T^^(x, a) = r(x, a) + -7r^(yfc), where the next sample yu is drawn from 
P{-\x,a)% 

*fe+i(x, a) ^ ^k{x, a) + 7^'=1'fc(x, a) - TTk^k{x). (22) 



4. We assume, hereafter, that we have access to the generative model of MDP, i.e., given the state-action 
pair {x,a) we can generate the next sample y from P{-\x,a) for all {x,a) G 2,. 
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The pseudo-code of DPP-RL algorithm is shown in Algorithm [2l 
Algorithm 2: (DPP-RL) Reinforcement learning with DPP 
Input: Initial the action preferences ^'o(')i discount factor 7, rj and number of 
iterations K 

1 Generate a set of i.i.d. samples {yi, 2/2)2/3, • • • , Vk}, for every (x, a) € Z, from 
P{-\x,a); 

2 for /c := 0, 1,2,3, ... ,K - 1 do 



3 
4 



for each {x, a) € 2, do 
for each a' (z A do 

. ,| N _ exp(77^fc(7/fc,aO) 

end 

'Jl'"^k{x,a) := r{x,a) + -iTik^kiVk)] 

a) := ^fc(x, a) + a) - 7rfc^'fc(x); 

end 



10 end 

11 iTK{a\x) 

12 return ttk 



exp(??^i^(a;,a)) 
exp(r?^'i^(j;,a')) 



Equation (j22p is just an approximation of DPP update rule (llSp . Therefore, the con- 
vergence result of Corollary [2] does not hold for DPP-RL. However, the new algorithm still 
converges to the optimal policy since one can show that the errors associated with approxi- 
mating (jlSp are asymptotically averaged out by DPP-RL, as postulated by Corollary [5l The 
following theorem establishes the asymptotic convergence of the policy induced by DPP-RL 
to the optimal policy. 

Theorem 6 (Asymptotic convergence of DPP-RL) Let Assumption[I\ hold. Assume 
that the initial action-value function ^0 is uniformly bounded by Imax o,nd tt^ is the policy 
induced by Algorithmic at round k. Then, w.p. 1, the following holds: 



lim Q'"'{x,a) 

fc— >-CXD 



Q*{x, a), V(x, a) G 2,. 



Proof See Appendix [El 



One may notice that the update rule of DP P-RL, unlike o ther incremental R L methods 
such as Q-learning dWatkins and Davanl . \l99^ ) and SARSA dSingh et al.l . l200d l. does not 
involve any decaying learning step. This is an important difference since it is known that the 
convergence rate of incremental RL methods like Q-learning is ve ry sensitive to the choice 
of learning step (jEven-Dar and Mansouii . l2003l : ISzepesvaril . 119971 ) and a bad choice of the 
learning step may lead to significantly slow rate of convergence. DPP-RL seems to not 
suffer from this problem (see Figure [3|) since the DPP-RL update rule is just an empirical 
estimate of the update rule of DPP. Therefore, one may expect that the rate of convergence 
of DPP-RL remains close to the fast rate of convergence of DPP established in Theorem [TJ 
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4.3 Approximate Dynamic Policy Programming with Linear Function 
Approximation 

In this subsection, we consider DPP with linear function approximation (LFA) and least- 
squares regression. Given a set of basis functions 3"^ = {(pi, . . . , 4>k}, where each (pi : Z ^ ^ 
is a bounded real valued function, the sequence of action preferences {^o, ^'i, ^'2 • • • } are 
defined as a linear combination of these basis functions: = ^l^-, where $ is a m x 1 
column vector with the entries {(pi}i=i:m and Oj. € K"^ is a ?n x 1 vector of parameters. 

The action preference function ^k+i is an approximation of the DPP operator In 
case of LFA the common approach to approximate DPP operator is to find a vector 
that projects O'^k on the column space spanned by $ by minimizing the loss function: 



2,M 



(23) 



where is a probability measure on %. The best solution, that minimize J, is called the 
least-squares solution: 



3k+i = arg min Jfc(0; ^) = [E($$T^^ ^^(^^Qq, ) 



(24) 



where the expectation is taken w.r.t. (x, a) ~ [i. In principle, to compute the least squares 
solution equation requires to compute for all states and actions. For large scale prob- 
lems this becomes infeasible. Instead, we can make a sample estimate of the least-squares 
solution by minimizing the empirical loss Jk{0;^): 



1 ^ 



n=l 



where {{Xn, An)}n=i:N is a set of N i.i.d. samples drawn from the distribution fi. Also, 
0„^'fc denotes a single sample estimate of O^^kiXm^n) defined by O^^'fc = ^'fc(X„,^„) + 
r{Xn,An) + 'yMr^^kiX'n) - M^^'fc(X„), where X'^ ~ P{-\Xn,An). Further, to avoid over- 
fitting due to the small size of data set, we add a quadratic regularization term to the loss 
function. The empirical least-squares solution which minimizes Jk{0; ^) is given by: 



Ok 



+1 



N 



^ ^Xn, AnMXn, An)'^ + aNI 



.n=l 



N 



J20n-^k^iXn,An). 



(25) 



n=l 
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Algorithm [3] presents the sampling-based approximate dynamic policy programming (SADPP) 
in which we rely on (j25p to approximate DPP operator at each iteration. 

Algorithm 3: (SADPP) Sampling-based approximate dynamic policy programming 

Input: do, rj, 7, a, K and 
1 for fc = 0,1,2,..., do 

Generate a set of i.i.d. samples {{Xn, An, X'^)}n=i:N by drawing A^ sample from 
H and P{-\Xn,An); 
forn = 1,2,3, ... ,Ar do 

■^k{Xn.An)=~el^{Xn,An); 

for each A' ^ A do 

■^k{Xn,A') = ~el^{Xn,A'); 
^k{X'^,A') = ~el^{X',,A'); 
end 

Mr,^k[XJ - p ^ exp„*fe{X;,A") ' 

A GA ^//g^ 

On^-fc = -^kiXn, An) - r(A„, An) 



end 

14 end 

15 return 



En=l HXn,An)HXn, A)^ + oATI 0„^fc^>(X„, ^ 



-1 



5. Numerical Results 



In this section, we analyze empirically the effectiveness of the proposed algorithms on dif- 
ferent problem domains. We first examine the convergence properties of DPP-RL (Al- 
gorithm [2]) on several discrete state-action prob lems and compare it with tw o standard 
algorithms: a synchronous variant of Q-lea rning (Even-Dar and Mansour . 20031 ) (QL) and 
the model-based Q-value iteration (VI) of iKearns and Singhl (jl999l ). Next, we investigate 
the finite-time performance of SADPP (Algorithm [3]) in the presence of function approxima- 
tion and a limited sampling budget per ite ration. In this case, we consid er a variant of the 
optimal replacement problem described in lMunos and Szepesvari (120081) and compare ou r 



method with regularized least-squares fitted Q-iteration (RFQI) (IFarahmand et all . l2008l ) . 



The source code of all tested algorithms are freely available in http : //www . mbf ys . ru . nl/~iiiazar/Research_To] 



5.1 DPP-RL 

We consider the following large-scale MDPs as benchmark problems: 

Linear MDP: this problem consists of states Xfc € X, /c = {1, 2, . . . , 2500} arranged in a 
one-dimensional chain (see Figured]). There are two possible actions A = {—1,-1-1} 
(left /right) and every state is accessible from any other state except for the two 
ends of the chain, which are absorbing states. A state S X is called absorbing 
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Figure 1: Linear MDP: Illustration of the linear MDP problem. Nodes indicate states. 

States Xl and X2500 are the two absorbing states and state Xk is an example of 
interior state. Arrows indicate possible transitions of these three nodes only. 
From Xk any other node is reachable with transition probability (arrow thickness) 
proportional to the inverse of the distance to x^ (see the text for details). 



if P{xk\xk,a) = 1 for all a € A and P{xi\xk,a) = 0,\/l 7^ k. Any transition to one of 
these two states has associated reward 1. 

The transition probability for an interior state x^ to any other state xi is inversely 
proportional to their distance in the direction of the selected action, and zero for 
all states corresponding to the opposite direction. Formally, consider the following 
quantity n{xi,a,Xk) assigned to all non-absorbing states x^ and to every {xi,a) S 2,: 

^ for (/ -k)a>0 



n{xi,a,Xk) = { 

otherwise 

We can write the transition probabilities as: 

n{xi,a,Xk) 



P{xi\xk,a) 



J2 n{xm,a,Xk)' 
ex 



Any transition that ends up in one of the interior states has associated reward —1. 

The optimal policy corresponding to this problem is to reach the closest absorbing 
state as soon as possible. 

Combination lock: the combination lock probl em considered here is a stoch astic variant 
of the reset state space models introduced in lKoenig and Simmons (|l993l l. where more 



than one reset state is possible (see Figure [2]). 

In our case we consider, as before, a set of states Xk € X,k € {1,2, .. . ,2500} ar- 
ranged in a one-dimensional chain and two possible actions yi = {— 1,-|-1}. In this 
problem, however, there is only one absorbing state (corresponding to the state lock- 
opened) with associated reward of 1. This state is reached if the all-ones sequence 
{-|-1, -|-1, . . . , -1-1} is entered correctly. Otherwise, if at some state x^, k < 2500, action 
— 1 is taken, the lock automatically resets to some previous state xi, I < k randomly 
(in the original combination lock problem, the reset state is always the initial state 

Xl). 



14 



Dynamic Policy Programming 




Figure 2: Combination lock: illustration of the combination lock MDP problem. Nodes 
indicate states. State X2500 is the goal (absorbing) state and state is an exam- 
ple of interior state. Arrows indicate possible transitions of these two nodes 
only. From Xk any previous state is reachable with transition probability (arrow 
thickness) proportional to the inverse of the distance to Xk- Among the future 
states only x^+i is reachable (arrow dashed). 



For every intermediate state, the rewards of actions —1 and +1 are set to and —0.01, 
respectively. The transition probability upon taking the wrong action —1 is, as before, 
inversely proportional to the distance of the states. That is 

for / < I n^ '"'(^' 



n{xk,xi) = { k - I , P{xi\xk,0) 



otherwise ^ n{k,m) 

Note that this problem is more difficult than the linear MDP since the goal state is 
only reachable from one state, X2499. 

Grid world: this MDP consists of a grid of 50 x 50 states. A set of four actions {RIGHT, 
UP, DOWN, LEFT} is assigned to every state x € X. The location of each state x 
of the grid is determined by the coordinates Cx = {hx, Vx), where hx and Vx are some 
integers between 1 and 50. There are 196 absorbing firewall states surrounding the 
grid and another one at the center of grid, for which a reward —1 is assigned. The 
reward for the firewalls is 

r{x,a) = —- — —, Va G yi. 

1 1 ''x 1 1 2 

Also, we assign reward to all of the remaining (non-absorbing) states. 

This means that both the top-left absorbing state and the central state have the least 
possible reward (—1), and that the remaining absorbing states have reward which 
increases proportionally to the distance to the state in the bottom-right corner (but 
are always negative). 

The transition probabilities are defined in the following way: taking action a from 
any non-absorbing state x results in a one-step transition in the direction of action a 
with probability 0.6, and a random move to a state y ^ x with probability inversely 
proportional to their Euclidean distance 1/ Hc^,. — Cj^Hg- 

The optimal policy then is to survive in the grid as long as possible by avoiding 
both the absorbing firewalls and the center of the grid. Note that because of the 
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difference between the cost of firewalls, the optimal control prefers the states near the 
bottom-right corner of the grid, thus avoiding absorbing states with higher cost. 

5.1.1 Experimental Setup and Results 

We describe now our experimental setting. The convergence propertie s of DPP-RL are com- 



pared with two other algorithms: a synchronous variant of Q-learning (|Even-Dar and Mansour . 



2OO3I ) (QL), which, like DPP-RL, updates the action- value functi on of all state-action pair s 



at each iteration, and the model-based Q- value iteration (VI) of Kearns and Singh ( 19991 ). 



VI is a batch reinforcement learning algorithm that first estimates the model using the 
whole data set and then performs value iteration on the learned model. 

All algorithms are evaluated in terms of ^oo-norm performance loss of the action-value 
function \\Q* — Q'^^\\ obtained by policy vTfc induced at iteration k. We choose this perfor- 
mance measure in order to be consistent with the performance measure used in Section [4l 
The discount factor 7 is fixed to 0.995 and the optimal action- value function Q* is computed 
with high accuracy through value iteration. 

We consider QL with polynomial learning step = l/{k + 1)'^ where u € {0.51,0.75} 
and the linear learning step ak = l/(/c + 1 ). Note that oj needs to be l arger than 0.5, 
otherwise QL can asymptotically diverge (see Even-Par and Mansour . 20031 . for the proof). 



To achieve the best rate of convergence for DPP-RL, we fix i] to +00 (see Section [3. 2p . 
This replaces the soft-max operator M,;y in the DPP-RL update rule with the max operator 
M, resulting in a greedy policy vTfc. 

To have a fair comparison of the three algorithms, since each algorithm requires dif- 
ferent number of computations per iteration, we fix the total computational budget of the 
algorithms to the same value for each benchmark. The computation time is constrained 
to 30 seconds in the case of linear MDP and the combination lock problems. For the grid 
world, which has twice as many actions as the other benchmarks, the maximum run time 
is fixed to 60 seconds. We also fix the total number of samples, per state-action, to 1 x 10^ 
samples for all problems and algorithms. Significantly less number of samples leads to a 
dramatic decrease of the quality of the obtained solutions using all the approaches. 

Algorithms were implemented as MEX files (in C+-I-) and ran on a Intel core 15 processor 
with 8 GB of memory, cpu time was acquired using the system function times () which pro- 
vides process-specific cpu time. Randomization was implemented using gsl_rng_unif orm() 
function of the GSL library, which is superior to the standard randol^ Sampling time, 
which is the same for all algorithms, were not included in cpu time. 

Figure [3] shows the performance- loss in terms of elapsed cpu time for the three problems 
and algorithms. The results are averages over 50 runs, where at the beginning of each 
run (i) the action-value function and the action preferences are randomly initialized in the 
interval [— Vmax, ^max]) and (ii) a new set of samples is generated from P(-|x,a) for all 
(x, a) G Z. Results correspond to the average error computed after a small fixed amount of 
iterations. 

First, we see that DPP-RL converges very fast achieving near optimal performance after 
a few seconds. DPP-RL outperforms both QL and VI in all the three benchmarks. The 
minimum and maximum errors are attained for the linear MDP problem and the Grid 



5. http://www.gnu.Org/s/gsL 
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10 20 30 10 20 30 20 40 60 



cpu time (sec) cpu time (sec) cpu time (sec) 

Figure 3: A comparison between DPP-RL, Q-Learning and model-based VI. Each plot 
compares the performance loss of the policy induced by the algorithms for a 
different MDP averaged over 50 different runs (see the text for details). 



world, respectively. We also observe that the difference between DPP-RL and QL is very 
significant, about two orders of magnitude, in both the linear MDP and the Combination 
lock problems. In the grid world DPP-RL's performance is more than 4 times better than 
that of QL. 

QL shows the best performance for lo = 0.51. The quality of the QL solution degrades 
as a function of ui. Concerning VI, its error shows a sudden decrease on the first error 
caused by the model estimation. 

The standard deviations of the performance-loss give an indication of how robust are the 
solutions obtained by the algorithms. Table [1] shows the final numerical outcomes of DPP- 
RL, QL and VI (standard deviations between parenthesis). We can see that the variance 
of estimation of DPP-RL is substantially smaller than those of QL and VI. 

Table 1: A Comparison between DPP-RL, Q-learning (QL) and the model-based value iter- 
ation (VI) given a fixed computational and sampling budget. Table [T] shows error 
means and standard deviations (between parenthesis) at the end of the simulations 
for three different algorithms (columns) and three different benchmarks (rows) 



Benchmark 


Linear MDP Combination lock Grid world 


Run Time 


30 sec. 30 sec. 60 sec. 


DPP-RL 


0.05 (0.02) 


0.20 (0.09) 


0.32 (0.03) 


VI 


16.60 (11.60) 


69.33 (15.38) 


5.67 (1.73) 


u = 0.51 
QL w = 0.75 
oj = 1.00 


4.08 (3.21) 
31.41 (12.77) 
138.01 (146.28) 


18.18 (4.36) 
176.13 (25.68) 
195.74 (5.73) 


1.46 (0.12) 
17.21 (7.31) 
25.92 (20.13) 
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These results show that, as suggested in Theorem [6] and HI DPP-RL manages to average 
out the simulation noise caused by sampling and converges, rapidly, to a near optimal solu- 
tion, which is very robust. In addition, we can conclude that DPP-RL performs significantly 
better than QL and VI in the three presented benchmarks for our choice of experimental 
setup. 



5.2 SADPP 

In this subsection, we illustrate the performance of the SADPP algorithm in the presence 
of function approximation and limited sampling budget per iteration. We conipare S ADPP 
with a modification of regularized fitted Q-iteration (RFQI) ( Farahmand et al. . 20081 ) which 
make use of a fixed number of basis functions. RFQI can be regarded as a Monte-Carlo 
sampling implementation of approximate value iteration with action-state representation. 
We compare SADPP with RFQI since both methods make use of ^2-regularization. The 
purpose of this subsection is to analyze numerically the sample complexity, i.e, the number 
of samples required to achieve a near optimal performance with a low variance, of SADPP. 
The benchmark we consider is a variant of the optimal replacement problem presented in 
Munos and Szepesvaril (j2008l ). In the following subsection we describe the problem and 
subsequently we present the results. 



5.2.1 Optimal replacement problem 

This problem is an infinite-horizon, discounted MDP. The state measures the accumulated 
use of a certain product and is represented as a continuous, one-dimensional variable. At 
each time-step t, either the product is kept a{t) = or replaced a{t) = 1. Whenever 
the product is replaced by a new one, the state variable is reset to zero x{t) = 0, at an 
additional cost C. The new state is chosen according to an exponential distribution, with 
possible values starting from zero or from the current state value, depending on the latest 
action: 



p{y\x,a = 0) 



if y < 



p{y\x,a = 1) 



/3e^y if y > 
if y < 



The reward function is a monotonically increasing function of the state x if the product 
is kept r(x,0) = — c(x) and constant if the product is replaced r{x, 1) = — C — c(0). 

The optimal action is to keep as long as the accumulated use is below a threshold or to 
replace otherwise: 



a*{x) 



if x G [0, x] 

1 if X > X 



Following Munos and Szepesvari ( 20081 ). x can be obtained exactly via the Bellman 
equation and is the unique solution to 



C 



c'{y) 

1-7 



1 — 76" 



-m-i)y 



dy. 
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5.2.2 Experimental setup and results 

For both SADPP and RFQI we map the state-action space using 20 radial basis functions 
(10 for the continuous one-dimensional state variable x, spanning the state space ^ ^ and 
2 for the two possible ac t ions). Other parameter values where chosen to be the same as 



Munos and Szepesvari ( 20081 ). that is, 7 = 0.6, /3 = 0.5, C = 30 and c(x) = 42;, which 



m 

results in x ~ 4.8665. We also fix an upper bound for the states, Xmax = 10 and modify 
the problem definition such that if the next state y happens to be outside of the domain 
[0, Xmax] then the product is replaced immediately, and a new state is drawn as if action 
a = 1 were chosen in the previous time step. 

To compare both Algorithms we discretize the state space in = 100 bins and use the 
following error measure: 

error = ^£lK(|)^«M, (26) 

where a is the action selected by the Algorithm. Note that, unlike RFQI which selects 
the action by choosing the action with the highest action-value function, SADPP induces a 
stochastic policy, that is, a distribution over actions. We select a for SADPP by choosing 
the most probable action from the induced soft-max policy, and then use this to compute 
(j26p . Both algorithms were implemented in MatLab and executed under the same hardware 
specifications of the previous section. 

We analyze the effect of using different number of samples per iteration, N G 
{50,150,500}@ The results are averages over 200 runs, where at the beginning of each 
run the vector Q is initialized in the interval [—1,1] for both algorithms. The rest of the 
parameters, including the regularization factor q and 77, were optimized for the best asymp- 
totic performance for each independently. 

FigureHlshows averages and standard deviations of the errors. First, we observe that for 
large A^, after an initial transient, both SADPP and RFQI reach a near optimal solution. 
We observe that SADPP asymptotically outperforms RFQI on average in all cases. The 
average error and the variance of estimation of the resulting solutions decreases with N 
in both approaches. A comparison of the variances after the transient suggests that the 
sample complexity of SADPP is significantly smaller than RFQI. Remarkably, the variance 
of SADPP using A^ = 50 samples is comparable to the one provided by RFQI using A^ = 500 
samples. Further, the variance of SADPP is reduced faster with increasing A^. These results 
allow to conclude that SADPP can have positive effects in reducing the effect of simulation 
noise, as postulated in Section HI 

6. Related Work 

There are other methods which rely on a incremental update of the policy. One well- 
known algorithm of this kind is the actor-critic method (AC), in which the actor uses the 
value function computed by the critic to guide the policy search ( Sutton and Barto . 19981 . 



chap. 6.6). An important extension of AC, the po licy-gradient actor critic (PGAC), ex- 



tends the idea of AC to problems of practical scale (jSutton et al.l . Il999l : iPeters and Schaall . 



For both algorithms a new independent set of samples are generated at each iteration. 
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10 
Iterations 



Figure 4: Numerical results for the optimal replacement problem. Each plot shows a com- 
parison of the error between SDAPP and RFQI Algorithms using certain num- 
ber of samples N. Error is defined as in equation ()26p . Areas indicate aver- 
ages plus/minus standard deviations of the error starting from 200 uniformly 
distributed initial conditions (see the text for details). 



20081 ). In PGAC, the actor updates the parameterized policy in the direction of the 
(natural) gradient of performance, provided by the critic. The gradient update ensures 
that PGAC asymptotically converges to a loca l maximuni , given that an unbiased estl 



mate of the gradient is provided by the c ritic ( Maei et al. . 2010 : Bhatnagar et al. . 20091 : 



Konda and Tsitsiklisl. liooil: iKakadel . l200lh . Oth er incremeiital RL methods include Q- 
learning ([Watkins and Davanl . llOoi ) and SARSA (jSingh et all . l200d ^ which can be consid- 
ered as the incremental variants of the value iteration and the optimistic policy iteration 
algorithms, respectively ( Bertsekas and Tsitsiklis . 19961 ). Th ese algorithms have been shown 
to converge to the o ptimal value function in tabular case ( Bertsekas and Tsitsiklis . 19961 : 
Jaakkola et al. . 19941 ). Also, there are some studies in the literature conc erning the asymp- 



totic convergence of Q-learni ng in the presence of function approximation ( Melo et al. . 20081 : 



Szepesvari and Smartl . 20041 ) . However, to the best of our knowledge, there is no preceding 



in the literature for asymptotic or finite-iteration performance loss bounds of incremental 
RL methods and this study appears to be the first to prove such a bound for an incremental 
RL algorithm. 

The work propo sed in this paper has some relation to recent work by iKappenI ()2005l ) 
and iTodoro^ (|2006l V who formulate a stochastic optimal control problem to find a condi- 
tional probability distribution p{x'\x) given an uncontrolled dynamics p{x'\x). The control 
cost is the relative entropy between p{x'\x) and p{x'\x) exp{r{x)). The difference is that 
in their work a restricted class of control problems is considered for which the optimal so- 
lution p can be computed directly in terms of p without requiring Bellman-like iterations. 
Instead, the present approach is more general, but does re quires Bellman-like iterat ions. 
Likewise, our formalism is superficially similar to PoWER ( Kober and Peteri . boosi ) and 
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SAEM (jviassis and Toussaintl . l2009l ^. which rely on EM algorithm to maximize a lower 



bound for the expected return in an iterative fashion. This lower-bound also can be written 
as a KL-divergence be tween two di s tribu tions. Another relevant study is relative entropy 
policy search (REPS) ( Peters et al. . 201[)i ) which relies on the idea of minimizing the rela- 
tive entropy to control the size of policy update. The main differences are: (i) the REPS 
algorithm is an actor-critic type of algorithm, while DPP is more a policy iteration type of 
method, (ii) In REPS the inverse temperature rj needs to be optimized while DPP con- 
verges to the optimal solution for any inverse temperature 77, and (iii) here we provide a 
convergence analysis of DPP, while there is no convergence analysis in REPS. 



7. Discussion and Future Works 

We have presented a new approach, dynamic policy programming (DPP), to compute the 
optimal policy in infinite-horizon discounted-reward MDPs. We have theoretically proven 
the convergence of DPP to the optimal policy for the tabular case. We have also provided 
performance-loss bounds for DPP in the presence of approximation. The bounds have been 
expressed in terms of supremum norm of average accumulated error as opposed to standard 
results for AVI and API which expressed in terms of supremum norm of the errors. We 
have then introduced a new incremental model-free RL algorithm, called DPP-RL, which 
relies on a sample estimate instance of DPP update rule to estimate the optimal policy. 
We have proven the asymptotic convergence of DPP-RL to the optimal policy and then 
have compared its, numerically, with the standard RL methods. Experimental results on 
various MDPs have been provided showing that, in all cases, DPP-RL is superior to other 
RL methods in terms of convergence rate. This may be due to the fact that DPP-RL, unlike 
other incremental RL methods, does not rely on stochastic approximation for estimating 
the optimal policy and therefore it does not suffer from the slow convergence caused by the 
presence of the decaying learning step in stochastic approximation. 

In this work, we are only interested in the estimation of the optimal policy and not the 
problem o f exploration . Ther efore, we have not compared our algorithms to the PAC-MDP 
methods (jStrehl et al.l . hOQ^h . in which the choice of the exploration policy impacts the 



behavior of the learning algorithm. Also, in this paper, we have not compared our results 
with those of (PG)AC since they rely on a different kind of sampling strategy: Both DPP- 
RL and SADPP rely on a generative model for sampling, whereas AC makes use of some 
trajectories of the state-action pairs, generated by Monte-Carlo simulation, to estimate the 
optimal policy. 

In this study, we provide £oo-norm performance-loss bounds for approximate DPP. How- 
ever, most supervised learning and regression algorithms rely on minimizing some form of 
ip-noim error. Therefore, it is natural to search for a kind of perf ormance bound that 
relies on the ip-novm of approximation error. Following MunosI ( 2005 ). ^p-norm bounds for 



approximate DPP can be established by providing a bound on the performance loss of each 
component of value function under the policy induced by DPP. This would be a topic for 
future research. 

Another direction for future work is to provide finite-sample probably approximately 
correct (PAC) bounds for SADPP and DPP-RL in the spi rit of previous theoretical re- 
sults available for fitted value iteration and fitted Q-iteration ( Munos and Szepesvari . 20081 : 
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Ant OS et al.l . |2008| ). In the case of SADPP, this would require extending the error prop- 



agation result of Theorem U] to an ^2-norm analysis and combining it with the standard 
regression bounds. 

Finally, an important extension of our results would be to apply DPP for large-scale 
action problems. In that case, we need an efficient way to approximate "Mrj^ki^) update 
rule (jlSp since computing the exact summations become expens ive. One idea is to sample 
estimate 3v[ri^k{x) using Monte-Carlo simulation ( MacKay . 20031 . chap. 29), since M^^fc(a;) 



is the expected value of ^^(a;, a) under the soft-max policy tt^. 

Appendix A. Proof of Proposition [1] 

We first introduce the Lagrangian function L (x; A^;) : X ^ 5ft: 

L (x; A,) = J]^(a|x) [r{x, a) + 7 (PV^) {x, a)] - -KL (7r(-|x) ||7f(-|x)) - A, 



^7r(a|x) - 1 



The maximization in ^ can be expressed as maximizing the Lagrangian function 
L {x,Xx)- The necessary condition for the extremum with respect to vr(-|x) is: 

07r[a\x) r] r] \Tr[a\xj ) 

which leads to: 

7f*(a|x) = 7f(a|x) exp {—7]Xx — 1) exp [r/(r(x, a) + 7 (PV^) {x, a))] , Vx G X. (27) 
The Lagrange multipliers can then be solved from the constraints: 

1 = ^7f*(a|x) = exp{-r]Xx - 1) ^7f(a|x)exp [r]{r{x,a) + 7 (PV^*) (x,a))] , 



aeA a£A 

Ax 



- log V^(a|x) exp [r/(r(x, a) + 7 (PT4*)(x, a))] -i. (28) 
71 ^-^ ri 

' aeA ' 

By plugging ([28]) into (f27|) we deduce: 

I ^ 7f (a|x) exp [r?(r(x, a) + 7 (PV^) (x, a))] 

^ (^f) = "F^^ — r-r^ r-r-^ ^ , r^Tr^^ , V(x,aj G L. (29) 

^ ' ^ ^7f(a|x)exp[r?(r(x,a)+7(Py^*)(x,a))]' 

aeA 

The results then follows by substituting ([29]) in 
Appendix B. Proof of Theorem [1] 

In this section, we provide a formal analysis of the convergence behavior of DPP. Our 
objective is to establish a rate of convergence for the value function of the policy induced 
by DPP. 
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Our main result is in the form of following finite-iteration performance-loss bound, for 
all k>0: 

1'^*-^""^ (1-7^ + 1) • ^''^ 

Here, Q'^'' is the action-values under the policy tt^ and vr^ is the policy induced by DPP 
at step k. 

To derive (j30p one needs to relate Q'^'' to the optimal Q*. Unfortunately, finding a 
direct relation between Q^'^ and Q* is not an easy task. Instead, we relate Q'^* to Q* via an 
auxiliary action- value function Qk, which we define below. In the remainder of this Section 
we take the following steps: (i) we express in terms of Qk in Lemma [71 (ii) we obtain 
an upper bound on the normed error \\Q* — Qk\\ in Lemma [H Finally, (iii) we use these 
two results to derive a bound on the normed error ||Q* — Q^'^ \\ ■ For the sake of readability, 
we skip the formal proofs of the Lemmas in this section since we prove a more general 
case in Section |Dl Further, in the sequel, we repress the state(-action) dependencies in our 
notation wherever these dependencies are clear, e.g., ^{x,a) becomes Q{x,a) becomes 
Q. 

Now let us define the auxiliary action- value function Q^. The sequence of auxiliary 
action- value functions {Qo, Qi, Q2, ■ ■ ■ } is obtained by iterating the initial Qq = ^0 from 
the following recursion: 

Qk = ^'^'-'Qk^i + l^-'-^Qo, (31) 

where vTfc is the policy induced by the k^^ iterate of DPP. 
Lemma [7] relates with Qi^: 

Lemma 7 Let k be a positive integer. Then, we have: 

^-fc = kQk + Qo- TTk~i{{k - l)Qfe_i + Qq). (32) 

Now we focus on relating and Q*: 

Lemma 8 Let Assumption\^ hold and L denotes the cardinality of A and k he a positive 
integer, also assume that ||^'o|| < then the following inequality holds: 



7 f4y„„ + issa) 



Lemma [8] provides an upper bound on the normed-error WQ^, — Q*\\. We make use of 
Lemma [8] to prove the main result of this Section: 

\\Q* - Q^" II = \\Q* - Qk+i + Qk+i - T^'Q* + T^'Q* - Q""" II 

< \\Q* -Qk+i\\ + \\Qk+i-7^''Q*\\ + \\'J'"'Q* -T'^Q^'W 

< \\Q* - Qk+i\\ + \\Qk+i - r^'Qll + 7 HQ* - Q"1l ■ 
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By collecting terms we obtain: 



< 



< 



1 


1 




7 




1 




1 




7 




1 




1 




7 




1 




1 




7 



k + 1 K + 1 



\\Q* - Qk+iW + Wr^'Q* - T'^^Qfcll + Wr^'Q* - r^'Qoi 



k + l 



k + 1 



This combined with Lemma [8] completes the Proof. 
Appendix C. Proof of Theorem [3] 

First, we note that converges to Q* (Lemma [8]) and subsequently vr^ converges to vr* 
by (|37p . Therefore, there exists a limit for ^'^ since writes in terms of Qf^, Qq and tt^.i 
(Lemma [7]). 

Now, we compute the limit of Qk converges to Q* with a linear rate from Lemma [8j 
Also, we have V* = 'k*Q* by definition of V* and Q* . Then, by taking the limit of ()32p we 
deduce: 

lim ^kix, a) = lim \kQ*{x, a) + Qq{x, a) - {k - l)V*{x) - (7r*Qo)(a;)] 

fc— >oo fe— >oo 

= lim k{Q*{x,a) - V*{x)) 

k—^oo 

+ Qo{x,a) - {Tr*Qo){x) + V*{x). 
This combined with Assumption [2] completes the Proof. 



Appendix D. Proof of Theorem [4] 

This Section provides a formal theoretical analysis of the performance of dynamic policy 
programming in the presence of approximation. 

Consider a sequence of the action preferences {^O) ^ii ^2, • • • } as the iterates of (fT8]l . 
Our goal is to establish an -^Q^-norm performance loss bound of the policy induced by 
approximate DPP. The main result is that at iteration /c > of approximate DPP, we have: 



(1-7)(A: + 1) 



7 4ymax + 



(1-7) 



fc+1 



(34) 



where = X^j=o ^k is the cumulative approximation error up to step k. Here, Q'^'' denotes 
the action- value function of the policy vTfc and vTfc is the soft-max policy associated with 

As in the proof of Theorem [H we relate Q* with Q'^'= via an auxiliary action- value 
function Q^- In the sequel, we first express in terms of in Lemma [9j Then, we 
obtain an upper bound on the normed error ||Q* — Qk\\ in Lemma [THl Finally, we use these 
two results to derive (131 
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Now, let us define the auxiliary action- value function Q^. The sequence of auxiliary 
action- value function {Qoi Qi) ^2, • • • } is resulted by iterating the initial action- value func- 
tion Qq = from the following recursion: 

Qk = ^'T'-'Qk-i + ICT'-'Qo + Ek^i), (35) 
where ()35p may be considered as an approximate version of ()31|) . Lemma [9] relates with 

Qk-. 

Lemma 9 Let k be a positive integer and vr^ denotes the policy induced by the approximate 
DPP at iteration k. Then we have: 

^fe = fcQfc + Qo- vTfc-i {{k - l)Qk-i + Qo) ■ (36) 

Proof We rely on induction for the Proof of this Theorem. The result holds for k = 1 
since one can easily show that (j36p reduces to (jlSp . We then show that if (j36p holds for k 
then it also holds for fc + 1. From (1181) we have: 



= kQk + Qo- vrfc-i((fc - l)Qfc-i + Qo) + T^H^Qfe + Qo - T^k^i{{k - l)Qk~i + Qo)) 

- TTkikQk + Qo- T^k-i{{k - l)Qk-i + Qo)) + e/c 
= kQk + Qo + T'l'ikQk + Qo- T^k-i{{k - l)Qk-i + Qo)) - T^k{kQk + Qo) 

+ Ek - Ek-i, 

where in the last step we make use of the following: 

7rfc7rfe_i(-) = 7rfc„i(.), a-^^fc_i(-) = T''-'{-). 

By collecting terms we deduce: 

^k+i = kQk -{k- l)7^^-'Qk-i - 7^'^-'Qo - Ek-i + k7^^Qk + T^'^Qo + Ek 
+ Qo- TTkikQk + Qo) 
= {k + l)Qk+i + Qo-MkQk + Qo), hy m- 

Thus (|36]) holds for -|- 1, and is thus true for all > 1. ■ 

Based on Lemma [U one can express the policy induced by DPP, vTfc, in terms of Q: 

I . exp{r]{kQk{x,a) + Qo{x,a) - TTk^iiik - l)Qk^i +Qo)ix))) 
vr.(a|x) = ^ 

(37) 

_ exp {rj [kQkjx, a) + Qo{x, a))) 
Z'{x) 

where Z'{x) = Z{x) exp {riTTk^i{{k — l)Qfc_i + Qo){x)) is the normalization factor. Equa- 
tion (f371) expresses vr^ in terms of Qk and Qo- In an analogy to Lemma [8] we state the 
following lemma that establishes a bound on ||Q* — Qk\\ 
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Lemma 10 ( ^oo-norm bound on Q* — Q^) Let Assumption{l\hold. Define Qj^ by (j35p . 

Let L denotes the cardinality of A and k be a non-negative integer, also, assume that ||^o|| ^ 
Vtnax; then the following inequality holds: 

^(av I '°g(^) ^ , k 

I I 'I I' max ^ v 1 X - , 

ll'?--'^^"^ ^ (1-,). ' 

We make use of the following results to prove Lemma [TUl 

Lemma 11 Let rj > and y be a finite set with cardinality L. Also assume that 9" denotes 
the space of measurable functions on y with V4ax C y set of all entries of^ which maximize 
f . Then the following inequality holds for all f G 3".' 



max f(y) — S^-^- 



ew{vfiy))fiy) < log(L) 



exp(ry/(y')) V 



Lemma 12 Letr] > and k be a positive integer. Assume \\Qo\\ < V^max; then the following 
holds: 

\\k7Qk + 7Qo - kT^Qk - T^'^Qoll < 7 f 2y^ax + 

Proof (Proof of Lemma [T0|) We rely on induction for the proof of this Lemma. Obviously 
the result holds for = 0. Then we need to show that if (j33p holds for k it also holds for 
k + l: 



W-Q 



k+l\ 



7Q* 
1 



k + l 



k + l 
1 , 



(TQ*-T-'=Qo) + 



k + l 
k 



1 



k + l 
1 



< 



k + l 

1 rn 



A; + l' " " ' k + l 

7Q* - 7Qo + 7Qo - T^^Qo + k{7Q* - 7Qk + IQk - T^'Qk) 

"EkW 



7Q* - TQoll + \\k7Qk + TQo - kT^^Qk - T^Qo] 
1 



k + l 

+ ^^\\7Q* -7Qk\\ + 



< 



k + l 
1 



\\Ek\\ 



k + l 

^k 



+ 



k + l 

[7 \\Q* -Qo\\ + \\k7Qk + TQo - kT^'Qk - r^'Qol 
1 



k + l 



\Q*-Qk\\ + 



k + l 



\\Ekh 



(38) 
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Now based on Lemma [T2] and by plugging (j33p into (j38p we have: 



7 



A; + 1 
1 



4Kiax + 



logL 



7] 



+ 



7 4ymax + 



^(1-7) 



+ 



7 4ymax + 



+ 



(1-7)(A; + 1) k + 



^ k+l 



k-j+l\ 



E. 



The result then follows, for all /c > 0, by induction. 



i=i 



Proof (Proof of Lemma [TT]) For any / G 3" we have: 

■exp(r//(y))/(y) s;^ exp{-rig{y))g{y) 



^ exp(-?7c/(y'))' 



with (5r(y) = maxygy f{y) — f{y). According to iMacKavl (j2003l . chap. 31): 

exp{-r]g{y))g{y) 1 , / ^ ^^ , ^ tr 

- --log2^exp(-r/5(y)) + -iJp, 



f-t, E exp(-?75f(y')) 



where Hp is the entropy of probability distribution p defined by: 



piy) 



exp(-r/ff(^)) 
exp{-r]g{y'))' 



The following steps complete the proof. 



E 



ex.p{-r]g{y))g{y) ^ 
exp(-r/c/(y')) ~ V 



< -Hp < 
V 



1 + E 6^p(-w(y))) 

log(L) 



+ -Hr, 



Proof (Proof of Lemma [T2]) We have, by definition of operator T: 

\\k7Qk + 7Qo - kT'^Qk - T-'^Qoll < 7 \\kPMQk + PMQo - kP^'^Qu - P'"=QolI 

= 7 ||P(MA;Qfc + MQo - T^k{kQk + Qo))|| 

< 7 WMkQk + MQo - T^k{kQk + Qo) II 

< 7 ||2MQo + M{kQk + Qo) - 7Tk{kQk + Qo)|| 

< 7 (2 IIQoll + \mkQk + Qo) - MrjikQk + Qo)\\) , 

(39) 
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where in the last hne we make use of Equation (|37p . The result then follows by compar- 
ing pop with Lemma nn 



Lemma [TO] provides an upper-bound on the normed-error \\Q* — Qp,\\. We make use of 
this result to derive a bound on the performance loss \\Q* — 

\\Q* - Q^" II = \\Q* - Qk+i + Qk+i - ^^'Q* + ^^'Q* - Q""' II 

< \\Q* - Qk+i\\ + \\Qk+i - 'r''Q*\\ + Wr^Q* - o-^Q"^- II 

< HQ* - g^+ill + IIQfc+i - T-'^Q-^ll + 7 HQ* - Q-'=|| . 
By collecting terms we obtain: 

HQ* - Q'll < (HQ* - Qfc+ill + HQfe+i - T-'=Q*||) 



1 


-7 






1 


-7 






1 


-7 






+ 


(1- 




1 



IIQ*-Qfc+i|| + 



k + 1 k + 1 



< T— ( WQ* - Qk+iW + TTTT 11^'^* - '^'QkW + WT'Q* - r^^Qol 

fc -i- 1 fc -|- 1 

T -T Ek- 



This combined with Lemma [10] completes the proof. 
Appendix E. Proof of Theorem [6] 

We begin the analysis by introducing some new notations. We define the estimation error 
associated with the k^^ iterate of DPP-RL as the difference between the Bellman operator 
T'^*^'fc(x,a) and its sample estimate: 

ek{x,a)=ri''^kix,a)-T''^k{x,a), V(x,a) G Z. 

The DPP-RL update rule can then be re-expressed in form of the more general approx- 
imate DPP update rule: 

a) = a) + T"=^'fc(x, a) - 7rfc^'fc(x, a) + ek{x, a). 

Now let us define 3"fc as the filtration generated by the sequence of all random variables 
{yi,y2,y3, ■ ■ ■ ,yk} drawn from the distribution P{-\x,a) for all (x,a) G 2.. We have the 
property that E(efc(x, a)|3"fc-i) = which means that for all (x,a) € Z the sequence of 
estimation errors {ei, €2, ■ ■ ■ , Cfc} is a martingale difference sequence w.r.t. the filtration 3"^. 
The asymptotic converge of DPP-RL to the optimal policy follows by extending the result 
of ()2ip to the case of bounded martingale differences. For that we need to show that the 
sequence of estimation errors {ej}j=o:fc is uniformly bounded: 
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Lemma 13 (Stability of DPP-RL) Let Assumption{l\hold and assume that the initial 
action-preference function is uniformly bounded by V,nax; then we have, for all k >0, 

ll-r^fe.T, II ^ 27 log L 47 log L 

r/(l -7) -7) 

Proof We first prove that < Kiiax by induction. Let us assume that the bound 

llTfc^-fcll <ymax holds. Thus 

< Ikll +7ll^"'=^fe+i|| < ||r|| +7|IM^,^'fc+i|| 
= ||r|| +7||M^ (^, + T^'^^;,_M^^fc)|| 

< ||r|| + 7 ||M^ {^k + ^I'^k - M^^fc) - M {^k + Tl'^k - M„^k) \\ 
+ ^\\M{^k + 'yi'^k-Mr,^k)\\ 

= M + + 7 ||M + ^^^fc - Mr^^k + - M^fc) II 

< ||r|| +^^^+7l|M(M^fc-M„M/fe)|| +7||M(vI/fe-M^fc)|| 
+ 7||MT^'=M/fc|| 

2 



27logL 11^,^ n „ „ 27logL 27^1og(L) 



27 log L _ 27 log L 

r?(l-7) ??(l-7) 

where we make use of Lemma [TT] to bound the difference between the max operator M(-) 
and the soft-max operator Mr;(-). Now, by induction, we deduce that for all A; > 0, 
ll'J'fe^fcll < 27logL/(77(l — 7)) + Vmax- The bound on is an immediate consequence 
of this result. ■ 



Now based on Lemma [13] and Corollary [5] we prove the main result. We begin by 
recalling the result of Corollary [5) 

hmsup HQ* - Q^MI < J™ ll^^ll > 

Thus to prove the convergence of DPP-RL we only need to show that l/{k + 1) \\Ek\\ 
asymptotically converges to w.p. 1. For this we rely on the stro ng law of large numbers 
for martingale differences (iHoffmann-Jgrgensen and Pisierl . ll976l V which states that the 



average of a sequence of martingale differences asymptotically converges, almost surely, to 
if the second moments of all entries of the sequence are bounded by some < U < 00. 
This is the case for the sequence of martingales {ei, £2, ■ ■ ■ } since we already have proven 
the boundedness of ||efc|| in Lemma [T3l Thus, we deduce: 

1™ -i—r-r\^k{x,a)\=0, w.p. 1. 
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Thus: 

lim ^^ll^fcll =0, w.p. 1. (40) 
The result then follows by combining (j40p with Corollary [5l 
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